ArchR takes as input aligned BAM or fragment files. These files are stored in the HDF5 file format (hierarchical data format version5). The HDF5 files are the constituent pieces of an ArchR analysis. They are called Arrow files. All Arrow files are grouped into a project, a compressed R data file. The Files are accessed in minimal chunks (parallel read and write operations). Therefore, in memory we do not have any large file sizes.
To select high quality cells TSS enrichment scores are used.
library(ArchR)
library(knitr)
library(rhdf5)
library(uwot)
library(tidyverse)
#library(caret)
h5disableFileLocking()
inputFiles <- getTutorialData("Hematopoiesis")
addArchRGenome("hg19")
ArrowFiles <- createArrowFiles(
inputFiles = inputFiles,
sampleNames = names(inputFiles),
minTSS = 4, #Dont set this too high because you can always increase later
minFrags = 1000, # minimum number of mapped fragments required
# count matrix, instead of using peak it uses fixed-width sliding window of bins across the whole genome
addTileMat = TRUE,
addGeneScoreMat = TRUE, # uses signal proximal to the TSS to estimate gene activity
subThreading = FALSE,
maxFrags = 1e+05,
minFragSize = 10,
maxFragSize = 2000,
QCDir = "QualityControl",
# the length in bp that wraps around nucleosomes ->
#identify fragments as sub-nucleosome spanning, mono-nucleosome spanning or multi-nucleosome spanning
nucLength = 147,
# integer vector -> define region up/downstream of TSS to include as promoter region
# can be used to calculate e.g fraction of reads in promoter (FIP)
promoterRegion = c(2000, 100),
# parameters for computing TSS enrichment scores, window (bp) centered at TSS = 101
# flanking window = 2000 bp
# norm = size of flank window used for normalization = 100 bp
# accessibility within 101 bp surrounding the TSS will be normalized to accessibility
# in 100 bp bins from -2000:-1901bp and 1901: 2000
TSSParams = list(101, 2000, 100),
# which chromosomes to exclude form downstream analysis
# in human and mouse: mitochondrial DNA (chrM) and male sex chromosome (chrY)
# the fragments are still stored in the arrow files
excludeChr = c("chrM", "chrY"),
# number of chunks to divide chromosomes in -> low-memory parallelized reading of input files
nChunk = 5,
# name of field in input bam file containing the barcode tag information
bcTag = "qname",
offsetPlus = 4, # offset applied to + stranded Tn5 insertion -> account for precise Tn5 binding site
offsetMinus = -5,
logFile = createLogFile("createArrows")
)
# the arrow files object simply contains a character vector of arrow file paths
ArrowFiles
## [1] "scATAC_BMMC_R1.arrow" "scATAC_CD34_BMMC_R1.arrow"
## [3] "scATAC_PBMC_R1.arrow"
Have a look at this for additional QC! https://bioconductor.org/packages/devel/bioc/vignettes/ATACseqQC/inst/doc/ATACseqQC.html
= signal to noise calculation
ATAC-seq data is enriched at TSS regions compared to other genomic regions, because large protein complexes bind to promoters. Therefore, there is a local enrichment of per-basepair accessibility relative to flanking regions. The score is approximated for each cell. The average accessibility within 50 bp of the TSS is divided by the average accessibility of the TSS flanking positions.
The reads around a particular TSS are collected and aggregated to form a distribution of reads which is centered on the TSS. This distribution extends to 1000bp in both directions * we take the average read depth in the 100 bps at each end flank and calcualte a fold change at each position over that average read depth (total of 200bp of averaged data) * If there is a high read signal at TSS there will be an increase in signal towards the middle of the distribution * high signal means that the region is accessible * the signal value at the center of the distribution is the TSS enrichment metric
Should be removed, because they can interfere with downstream analysis.
Doublet detection-and-removal algorithm: Heterotypic doublets are identified by generating a collection of synthetic doublets. These synthetic doublets are projected into low-dimensional embeddings. Searching for nearest neighbours to the synthetic doublets we can identify doublets in the dataset. This outperforms the prediction of doublets using fragment number (ROC-AUC). (Compared to demuxlet as ground truth)
We can also identify doublets in the scRNA-seq space if we have paired data and remove the cells in this way.
# for each sample provided doublet information will be assigned to each cell
# this way we can remove doublet-based clusters downstream
doubScores <- addDoubletScores(
useMatrix = "TileMatrix",
input = ArrowFiles,
k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
nTrial = 5, # number of time to simulate nCell doublets
knnMethod = "UMAP", #Refers to the dimensionality reduciton method to use for nearest neighbor search.
LSIMethod = 1 # oder of normalization: tf-log(idf)
)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# the doublet information is saved in a simpleListobject
# read in the object
doublet_summary <- readRDS("QualityControl/scATAC_BMMC_R1/scATAC_BMMC_R1-Doublet-Summary.rds")
doublet_summary[[2]] %>% head() %>% kable()
| x | y | density | type | color | |
|---|---|---|---|---|---|
| doublet_14733 | 5.8904443 | -3.702020 | 0.0006007 | simulated_doublet | 0.0006007 |
| doublet_14467 | 5.5716429 | -3.579292 | 0.0006255 | simulated_doublet | 0.0006255 |
| doublet_4584 | 5.4819946 | -3.693586 | 0.0006264 | simulated_doublet | 0.0006264 |
| doublet_14409 | 0.6149653 | 3.274105 | 0.0009510 | simulated_doublet | 0.0009510 |
| doublet_12622 | 6.3805246 | -3.755051 | 0.0010196 | simulated_doublet | 0.0010196 |
| doublet_5702 | 5.2130489 | -3.519199 | 0.0010345 | simulated_doublet | 0.0010345 |
# get first entry of the list = originalDataUMAP
p1 <- doublet_summary [[1]] %>%
ggplot() +
geom_point(aes(x = X1, y = X2, col = enrichment), size = .1) +
scale_color_viridis_c() +
guides(fill=guide_legend(title="DoubletEnrichment")) +
labs(title = "Simulated Doublet Enrichment over expectation")
p2 <- doublet_summary [[1]] %>%
ggplot() +
geom_point(aes(x = X1, y = X2, col = score), size = .1) +
scale_color_viridis_c() +
guides(fill=guide_legend(title="DoubletScores -log10(P-adj)")) +
labs(title = "Doublet Scores -log10(P-adj)")
p3 <- doublet_summary[[2]] %>%
ggplot() +
geom_point(aes(x = x, y = y, col = density), size = .1) +
scale_color_viridis_c() +
guides(fill=guide_legend(title="Simulated Doublet Density")) +
labs(title = "Doublet density")
p4 <- doublet_summary[[2]] %>%
ggplot() +
geom_point(aes(x = x, y = y, col = density), size = .1) +
geom_point(data = doublet_summary[[1]], aes(x = X1, y = X2), size = .1, alpha = .4) +
scale_color_viridis_c() +
guides(fill=guide_legend(title="Simulated Doublet Density")) +
labs(title = "Simulated doublet density overlayed")
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)
An ArchR Project is initialized with some important attributes:
sampleColData -> matrix containint data for each samplecellColData -> contains data associated with each cell
addDoubletScore() there will be a column for “Doublet Enrichment” and “Doublet Score”proj <- ArchRProject(
ArrowFiles = ArrowFiles,
outputDirectory = "ArchRVignette",
copyArrows = TRUE, #This is recommened so that you maintain an unaltered copy for later usage.
geneAnnotation = getGeneAnnotation(),
#genomeAnnotation = getGeneAnnotation(),
showLogo = FALSE
)
Cell Names:
# cell names
proj$cellNames[1:5]
## [1] "scATAC_BMMC_R1#TTATGTCAGTGATTAG-1" "scATAC_BMMC_R1#AAGATAGTCACCGCGA-1"
## [3] "scATAC_BMMC_R1#GCATTGAAGATTCCGT-1" "scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1"
## [5] "scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1"
Sample Names:
proj$Sample[1:5]
## [1] "scATAC_BMMC_R1" "scATAC_BMMC_R1" "scATAC_BMMC_R1" "scATAC_BMMC_R1"
## [5] "scATAC_BMMC_R1"
length(proj$Sample)
## [1] 10660
For each cell there is a TSS enrichment score: Is this the average TSS enrichment score for each cell?
# TSS Enrichment score
proj$TSSEnrichment %>% head
## [1] 7.204 7.949 4.447 6.941 4.771 9.185
quantile(proj$TSSEnrichment)
## 0% 25% 50% 75% 100%
## 4.1090 13.9255 16.8150 19.9285 41.9800
hist(proj$TSSEnrichment, xlab = "TSS enrichment score", main = "TSS Enrichment Score", breaks = 100)
# reads in TSS
hist(proj$ReadsInTSS, xlab = "Reads in TSS", main = "Reads in TSS", breaks = 100)
hist(proj$ReadsInPromoter, xlab = "Reads in Promoter", main = "Reads in Promoter", breaks = 100)
hist(proj$ReadsInBlacklist, xlab = "Reads in Blcklist", main ="Reads in Blacklist", breaks = 100)
number of fragments:
# number of fragmetns for each cell
proj$nFrags %>% head
## [1] 26189 20648 18991 18296 17458 17109
quantile(proj$nFrags)
## 0% 25% 50% 75% 100%
## 1001.00 2150.00 3049.50 4332.25 91194.00
p1 <- ggplot(as_tibble(proj$nFrags), aes(x = value)) + geom_histogram(binwidth = 100) +
labs(title = "number of unique fragments") +
xlab("number of unique fragments")
p2 <- ggplot(as_tibble(proj$nMultiFrags), aes(x = value)) + geom_histogram(binwidth = 100) +
labs(title = "number of mulit-nucleosome fragments") +
xlab("number of mulit-nucleosome fragments")
p3 <- ggplot(as_tibble(proj$nMonoFrags), aes(x = value)) + geom_histogram(binwidth = 100) +
labs(title = "number of mono-nucleosome fragments") +
xlab("number of mono-nucelosome fragments")
p4 <- ggplot(as_tibble(proj$nDiFrags), aes(x = value)) + geom_histogram(binwidth = 100) +
labs(title = "number of Di-nucleosome fragments") +
xlab("number Di-fragments")
gridExtra::grid.arrange(p1, p2, p3,p4, ncol = 2)
# note that that only 100 cells are accessed
proj[1:100,]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 100
## medianTSS(1): 10.794
## medianFrags(1): 10200.5
proj[proj$cellNames[1:100], ]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 100
## medianTSS(1): 10.794
## medianFrags(1): 10200.5
idxSample <- BiocGenerics::which(proj$Sample %in% "scATAC_BMMC_R1")
cellsSample <- proj$cellNames[idxSample]
proj[cellsSample, ]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 4932
## medianTSS(1): 15.2575
## medianFrags(1): 2771
idxPass <- which(proj$TSSEnrichment >= 8)
cellsPass <- proj$cellNames[idxPass]
proj[cellsPass, ]
## class: ArchRProject
## outputDirectory: /omics/groups/OE0533/internal/katharina/scDoRI/ArchR/ArchRVignette
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 10499
## medianTSS(1): 16.899
## medianFrags(1): 3042
getAvailableMatrices(proj)
## [1] "GeneScoreMatrix" "TileMatrix"
df <- getCellColData(proj, select = "nFrags")
df %>% head %>% kable()
| nFrags | |
|---|---|
| scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 | 26189 |
| scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 | 20648 |
| scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 | 18991 |
| scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 | 18296 |
| scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 | 17458 |
| scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 | 17109 |
df <- getCellColData(proj, select = c("log10(nFrags)", "nFrags -1 "))
df %>% head %>% kable()
| log10.nFrags. | nFrags..1. | |
|---|---|---|
| scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 | 4.418119 | 26188 |
| scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 | 4.314878 | 20647 |
| scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 | 4.278548 | 18990 |
| scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 | 4.262356 | 18295 |
| scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 | 4.241994 | 17457 |
| scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 | 4.233225 | 17108 |
Data before QC and corresponding plots are saved in the Quality Control output folder.
#create 3 separate dataframes for all samples
three_samples <- map(unique(proj$Sample), function(name){
index <- BiocGenerics::which(proj$Sample %in% name)
cells <- proj$cellNames[index]
sample_subset <- proj[cells]
df <- getCellColData(sample_subset, select = c("log10(nFrags)", "TSSEnrichment"))
p <- ggPoint(
x = df[, 1], y = df[, 2],
colorDensity = TRUE, # should the density of points on the plot be indicated by color?
continuousSet = "sambaNight",
xlabel = "Log10 unique fragments",
ylabel = "TSS enrichment",
title = paste0("Sample: ", name),
xlim = c(log10(500), quantile(df[,1], probs = 0.99)),
ylim = c(0, quantile(df[,2], probs = 0.99))
) + geom_hline(yintercept = 4, lty = "dashed") +
geom_vline(xintercept = 3, lty = "dashed")
list(plot = p, name = name)
})
gridExtra::grid.arrange(three_samples[[1]]$plot, three_samples[[2]]$plot,
three_samples[[3]]$plot, ncol = 3)
We want a TSS enrichment score of > 4 and a number of unique fragments > 1000 (log10(1000) = 3).
p1 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_")) %>%
ggplot() +
geom_density(aes(x = TSSEnrichment, fill = Sample), alpha = 0.8)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
p2 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_")) %>%
ggplot() +
ggridges::geom_density_ridges(aes(x = TSSEnrichment, y = Sample,
fill = Sample), alpha = 0.8) +
theme(legend.position = "none")
p3 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_")) %>%
ggplot() +
geom_violin(aes(x = Sample, y = TSSEnrichment, fill = Sample), alpha = 0.8) +
geom_boxplot(aes(x = Sample, y = TSSEnrichment,fill = Sample), alpha = 0.4) +
theme(legend.position = "none") +
labs(title = "TSS Enrichment")
cowplot::plot_grid(p3, p2, p1, ncol = 3)
p1 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_"), log10_nFrags = log10(nFrags)) %>%
ggplot() +
geom_density(aes(x = log10_nFrags, fill = Sample), alpha = 0.8)
p2 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_"), log10_nFrags = log10(nFrags)) %>%
ggplot() +
ggridges::geom_density_ridges(aes(x = log10_nFrags, y = Sample,
fill = Sample), alpha = 0.8) +
theme(legend.position = "none")
p3 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_"), log10_nFrags = log10(nFrags)) %>%
ggplot() +
geom_violin(aes(x = Sample, y = log10_nFrags, fill = Sample), alpha = 0.8) +
geom_boxplot(aes(x = Sample, y = log10_nFrags,fill = Sample), alpha = 0.4) +
theme(legend.position = "none") +
labs(title = "number of fragments")
cowplot::plot_grid(p3, p2, p1, ncol = 3)
p1 <- plotFragmentSizes(ArchRProj = proj)
p2 <- plotTSSEnrichment(ArchRProj = proj)
ggAlignPlots(p1, p2, type = "v")
With the function addDoubleScores() information on predicted doublets has been added. Filter the putative doublets. They are not removed physically, but excluded from downstream analysis. ArchR automatically prints the number of cells removed from each sample and the corresponding percentage which is very handy.
arguments:
maximum would be filterRatio * 5000 / 100000 = filterRatio * 5000 * 0.05
This way samples with different percentage of doublets will be filtered accordingly.
# in our case we now have 10 251 cells as opposed to 10 661 cells before
# filtering -> 410 cells were removed (3.85%)
proj <- filterDoublets(ArchRProj = proj)
Because we can have maximally two accessible alleles per cell, the scATAC-seq data is sparse. Therefore, the majority of accessible regions are not transposed, meaning that most loci will have 0 accessible alleles. A zero could mean “non-accessible” or “not sampled”. For many analysis we can use a binarized matix. Imporantly, the 1s have information, BUT the 0s do not!
A PCA would result in high inter-cell similarity at all 0 positions. An alternative approach for dimensionality reduction is a layered dimensionality reduction. First, Latent Semantic Indexing (LSI) is used. LSI is an approach from language processing. Different samples are the “documents” and different regions/peaks are the “words”.
\(TF = \frac{C_{ij}}{F_{j}}\) with \(C_{ij}\) being the total number of counts for peak i in cell j and \(F_{j}\) being the total number of counts in cell j.
\(IDF = \frac{N}{n_{p}}\) with N being the total number of cells in the dataset and \(n_{p}\) being the total number of coutns for peak i across all cells.
\(TF-IDF = \log{1 + (TF * IDF) 10^{4}}\)
With LSI we can reduce the dimensionality of the sparse insertion matrix to tens or hundreds. Then UMAP or t-SNE can be used to visualize the data
Unlike in scRNA-seq we cannot select the top highly variable features before dimensionality reduction (high noise, low reproducibility). Rather the iterative LSI approach first computes a LSI on the most accessible tiles (this will identify clusters corresponding to the major cell types). Then, ArchR computes the average accessibility across these clusters across all features. Next, the most variable peaks across these clusters are identified. The most highly accessible peaks are the features of a new round of LSI. We can set how many rounds of LSI we want to be peformed.
Using iterative LSI reduces batch effects. If you see some batch effects you could try to add more LSI iterations and start from a lower initial clustering resolution. Also, the number of variable features can be lowered.
The LSI transformation is, therefore, based on a subset of cells only and the remaining cells are projected into this landmark LSI. This approach minimizes memory usage. The landmark set size depends on cell type proportions.
proj <- addIterativeLSI(ArchRProj = proj, useMatrix = "TileMatrix", name = "IterativeLSI")
proj <- addIterativeLSI(
ArchRProj = proj,
useMatrix = "TileMatrix",
name = "IterativeLSI",
iterations = 2,
clusterParams = list( #See Seurat::FindClusters
resolution = c(0.2),
sampleCells = 10000,
n.start = 10
),
varFeatures = 25000,
dimsToUse = 1:30
)
Calling clusters in this new space uses the Seurat’s graph clustering function as default clustering method. The Seurat method first computs KNN graph and then a modularity optimization technique to cluster the cells (iteratively group cells together with Louvian algorithm using 10 random starts). Another option is to use “Scran”. The default number of nearest neighbors used is 10. The minimum number of cells for a cluster to be called a cluster is set to 5 by default. The maximum number of clusters to be called is set to 25 by default.
proj <- addClusters(input = proj, reducedDims = "IterativeLSI")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10250
## Number of edges: 473760
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8640
## Number of communities: 13
## Elapsed time: 1 seconds
cellColDatacolorBy and name parameterproj <- addUMAP(ArchRProj = proj, reducedDims = "IterativeLSI")
## Warning in sprintf(gettext(fmt, domain = domain), ...): one argument not used by
## format 'invalid uid value replaced by that for user 'nobody''
## Warning: invalid uid value replaced by that for user 'nobody'
## Warning in sprintf(gettext(fmt, domain = domain), ...): one argument not used by
## format 'invalid gid value replaced by that for user 'nobody''
## Warning: invalid gid value replaced by that for user 'nobody'
df <- as_data_frame(cbind(getCellColData(proj), getEmbedding(proj)) ) %>%
rename(c(umap1 = IterativeLSI.UMAP_Dimension_1, umap2 = IterativeLSI.UMAP_Dimension_2))
variables <- c("Clusters", "Sample", "nFrags", "DoubletScore")
plots1 <- map(c("Clusters", "Sample"), function(n){
ggplot() +
geom_point(aes(x = df %>% pull("umap1"), y = df %>% pull("umap2"),
col = df %>% pull(n)), size = .04) +
guides(col=guide_legend(title=paste0(n))) +
xlab("umap1") +
ylab("umpa2") +
labs(title = paste0(n))
})
plots2 <- map(c("nFrags", "DoubletScore"), function(n){
ggplot() +
geom_point(aes(x = df %>% pull("umap1"), y = df %>% pull("umap2"),
col = df %>% pull(n)), size = .04) +
scale_color_viridis_c() +
guides(fill=guide_legend(title=paste0(n))) +
xlab("umap1") +
ylab("umpa2") +
labs(title = paste0(n))
})
do.call(gridExtra::grid.arrange, c(plots1, ncol=2))#, nrow = 2))
```#{r, fig.width=8} # color by sample p1 <- plotEmbedding(ArchRProj = proj, colorBy = “cellColData”, name = “Sample”, embedding = “UMAP”)
p2 <- plotEmbedding(ArchRProj = proj, colorBy = “cellColData”, name = “Clusters”, embedding = “UMAP”)
ggAlignPlots(p1, p2, type = “h”)
# Cluster assignment using gene scores
For the toy dataset marker genes of known hematopoietic regulators
can be used. Using MAGIC we add imputation weights to smooth the dropout noise in the gene scores
```r
proj <- addImputeWeights(proj)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", "MME", #B-Cell Trajectory
"CD14", "MPO", #Monocytes
"CD3D", "CD8A"#TCells
)
p <- plotEmbedding(
ArchRProj = proj,
colorBy = "GeneScoreMatrix",
name = markerGenes,
embedding = "UMAP",
imputeWeights = getImputeWeights(proj)
)
do.call(gridExtra::grid.arrange, c(p, ncol = 3))
Browse local chromatin accessibility at marker genes. Plot genome browser tracks per cluster
p <- plotBrowserTrack(
ArchRProj = proj,
groupBy = "Clusters",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000
)
## GRanges object with 9 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 208059883-208084683 - | 947 CD34
## [2] chrX 48644982-48652717 + | 2623 GATA1
## [3] chr9 36838531-37034476 - | 5079 PAX5
## [4] chr11 60223282-60238225 + | 931 MS4A1
## [5] chr3 154741913-154901518 + | 4311 MME
## [6] chr5 140011313-140013286 - | 929 CD14
## [7] chr17 56347217-56358296 - | 4353 MPO
## [8] chr11 118209789-118213459 - | 915 CD3D
## [9] chr2 87011728-87035519 - | 925 CD8A
## -------
## seqinfo: 24 sequences from hg19 genome
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
grid::grid.newpage()
grid::grid.draw(p$CD14)
FindTransferAnchors() function from Seurat is usedApart from using this information for identifying clusters we can also use it for identifying predicted cis-regulatory elements.
if(!file.exists("scRNA-Hematopoiesis-Granja-2019.rds")){
download.file(
url = "https://jeffgranja.s3.amazonaws.com/ArchR/TestData/scRNA-Hematopoiesis-Granja-2019.rds",
destfile = "scRNA-Hematopoiesis-Granja-2019.rds"
)
}
# ranged summarized Experiment
seRNA <- readRDS("scRNA-Hematopoiesis-Granja-2019.rds")
seRNA
## class: RangedSummarizedExperiment
## dim: 20287 35582
## metadata(0):
## assays(1): counts
## rownames(20287): FAM138A OR4F5 ... S100B PRMT2
## rowData names(3): gene_name gene_id exonLength
## colnames(35582): CD34_32_R5:AAACCTGAGTATCGAA-1
## CD34_32_R5:AAACCTGAGTCGTTTG-1 ...
## BMMC_10x_GREENLEAF_REP2:TTTGTTGCATGTGTCA-1
## BMMC_10x_GREENLEAF_REP2:TTTGTTGCATTGAAAG-1
## colData names(10): Group nUMI_pre ... BioClassification Barcode
Lets have a look at the count matrix:
# sparse count matrix of scRNA-seq
assays(seRNA)[[1]][1:10, 1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## CD34_32_R5:AAACCTGAGTATCGAA-1 CD34_32_R5:AAACCTGAGTCGTTTG-1
## FAM138A . .
## OR4F5 . .
## AL627309.1 . .
## OR4F29 . .
## OR4F16 . .
## FAM87B . .
## LINC00115 . .
## FAM41C 1 .
## AL645608.2 . .
## SAMD11 . .
## CD34_32_R5:AAACCTGGTTCCACAA-1 CD34_32_R5:AAACGGGAGCTTCGCG-1
## FAM138A . .
## OR4F5 . .
## AL627309.1 . .
## OR4F29 . .
## OR4F16 . .
## FAM87B . .
## LINC00115 . .
## FAM41C 1 .
## AL645608.2 . .
## SAMD11 . .
## CD34_32_R5:AAACGGGAGGGAGTAA-1
## FAM138A .
## OR4F5 .
## AL627309.1 .
## OR4F29 .
## OR4F16 .
## FAM87B .
## LINC00115 .
## FAM41C .
## AL645608.2 .
## SAMD11 .
Metadata of the scRNA-seq dataset:
We already have clustering, umap embeddings and cell types.
colData(seRNA) %>% head %>% knitr::kable()
| Group | nUMI_pre | nUMI | nGene | initialClusters | UMAP1 | UMAP2 | Clusters | BioClassification | Barcode | |
|---|---|---|---|---|---|---|---|---|---|---|
| CD34_32_R5:AAACCTGAGTATCGAA-1 | CD34_D2T1 | 17876 | 8303 | 3187 | Cluster1 | -6.113410 | 4.616498 | Cluster5 | 05_CMP.LMPP | AAACCTGAGTATCGAA-1 |
| CD34_32_R5:AAACCTGAGTCGTTTG-1 | CD34_D2T1 | 9277 | 3917 | 1787 | Cluster2 | -8.800932 | -1.228907 | Cluster8 | 08_GMP.Neut | AAACCTGAGTCGTTTG-1 |
| CD34_32_R5:AAACCTGGTTCCACAA-1 | CD34_D2T1 | 13073 | 6023 | 2552 | Cluster3 | -9.723482 | 7.335178 | Cluster1 | 01_HSC | AAACCTGGTTCCACAA-1 |
| CD34_32_R5:AAACGGGAGCTTCGCG-1 | CD34_D2T1 | 8412 | 4493 | 2191 | Cluster4 | -4.293071 | 5.692705 | Cluster6 | 06_CLP.1 | AAACGGGAGCTTCGCG-1 |
| CD34_32_R5:AAACGGGAGGGAGTAA-1 | CD34_D2T1 | 11914 | 5190 | 2322 | Cluster3 | -7.989706 | 9.108693 | Cluster1 | 01_HSC | AAACGGGAGGGAGTAA-1 |
| CD34_32_R5:AAACGGGAGTTACGGG-1 | CD34_D2T1 | 12075 | 4634 | 2152 | Cluster3 | -7.271959 | 6.980415 | Cluster1 | 01_HSC | AAACGGGAGTTACGGG-1 |
Plot Quality Metrics of the scRNA-seq dataset:
as_data_frame(colData(seRNA)) %>%
select(nUMI, nGene, Group) %>%
pivot_longer(cols = !Group, names_to = "stat") %>%
ggplot() +
geom_violin(aes(x = Group, y = value, fill = Group)) +
facet_wrap(~stat, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
xlab("Sample")
#pivot_longer(cols = !s)
df <- as_data_frame(colData(seRNA))
p1 <- ggplot() +
geom_point(aes(x = df %>% pull("UMAP1"), y = df %>% pull("UMAP2"),
col = df %>% pull("BioClassification")), size = .04) +
guides(col=guide_legend(title="CellType")) +
xlab("umap1") +
ylab("umpa2") +
labs(title = "scRNA-seq dataset - cell type")
p2 <- ggplot() +
geom_point(aes(x = df %>% pull("UMAP1"), y = df %>% pull("UMAP2"),
col = df %>% pull("nGene")), size = .04) +
guides(col=guide_legend(title="number of genes")) +
scale_color_viridis_c() +
xlab("umap1") +
ylab("umpa2") +
labs(title = "scRNA-seq dataset - gene number")
p3 <- df %>% ggplot() +
geom_point(aes(x = UMAP1, y = UMAP2, col = Group), size = .04) +
guides(col=guide_legend(title="Sample")) +
xlab("umap1") +
ylab("umpa2") +
labs(title = "scRNA-seq dataset - sample")
p1
gridExtra::grid.arrange(p2, p3, ncol = 2)
table(colData(seRNA)$BioClassification)
##
## 01_HSC 02_Early.Eryth 03_Late.Eryth 04_Early.Baso 05_CMP.LMPP
## 1425 1653 446 111 2260
## 06_CLP.1 07_GMP 08_GMP.Neut 09_pDC 10_cDC
## 903 2097 1050 544 325
## 11_CD14.Mono.1 12_CD14.Mono.2 13_CD16.Mono 14_Unk 15_CLP.2
## 1800 4222 292 520 377
## 16_Pre.B 17_B 18_Plasma 19_CD8.N 20_CD4.N1
## 710 1711 62 1521 2470
## 21_CD4.N2 22_CD4.M 23_CD8.EM 24_CD8.CM 25_NK
## 2364 3539 796 2080 2143
## 26_Unk
## 161
p1 <- as_data_frame(getCellColData(proj)) %>%
mutate(Sample = str_remove(Sample, "scATAC_"))
atac <- as_data_frame(getCellColData(proj)) %>%
rownames_to_column("cell") %>%
mutate(cell = str_extract(rownames(getCellColData(proj)), "(?<=#)[^#]+"))
atac
rna <- as_data_frame(colData(seRNA)) %>%
rownames_to_column("cell") %>%
mutate(cell = str_extract(rownames(colData(seRNA)), "(?<=:)[^:]+")) %>%
filter(cell %in% atac[["cell"]])
cbind(atac["nFrags"], rna["nUMI"])
The integration matrix will be stored in the ArchR project via matrixName. The other parameters will be saved in the cellColData (nameCell = store matched cell ID from scRNA-seq, nameGroup = store group ID from scRNA-seq, nameScore = store cross-platform integration score)
proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupRNA = "BioClassification", # Bioclassification is a column in
# the colData of the seRNA object, this column will be used to
# determine the subgroupings specified in groupList for constrained integration
# it is also used for the nameGroup output of this function
nameCell = "predictedCell_Un", # name the cellColData for the
# predicted scRNA-seq cell in the specified ArchRProject ->
# will add a column to the project
nameGroup = "predictedGroup_Un",
nameScore = "predictedScore_Un"
)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
df <- as_data_frame(cbind(getCellColData(proj), getEmbedding(proj)) ) %>%
rename(c(umap1 = IterativeLSI.UMAP_Dimension_1, umap2 = IterativeLSI.UMAP_Dimension_2))
p1 <- df %>% ggplot() +
geom_point(aes(x = umap1, y = umap2, col = predictedGroup_Un), size = .04) +
guides(col=guide_legend(title="predicted cell type")) +
xlab("umap1") +
ylab("umpa2") +
labs(title = "Predicted cell types after unconstrained integration")
p2 <- df %>% ggplot() +
geom_point(aes(x = umap1, y = umap2, col = predictedScore_Un), size = .04) +
guides(col=guide_legend(title="prediction score")) +
scale_color_viridis_c() +
xlab("umap1") +
ylab("umpa2") +
labs(title = "Predicted cell types after unconstrained integration")
gridExtra::grid.arrange(p1, p2, ncol = 1)
# in the confusion matrix the rows correspond to clusters in scATAC data
# the columns correspond to cell types in scRNA-seq data
cM <- as.matrix(confusionMatrix(proj$Clusters, proj$predictedGroup_Un))
cM[, 1:5]
## 17_B 16_Pre.B 08_GMP.Neut 25_NK 11_CD14.Mono.1
## C4 407 1 0 1 0
## C2 3 330 0 2 0
## C10 3 0 276 0 74
## C6 0 0 0 570 0
## C8 0 0 6 0 1237
## C12 0 0 26 5 0
## C5 0 0 0 4 0
## C11 0 0 751 0 2
## C7 0 0 0 14 0
## C13 0 0 60 2 0
## C1 0 0 0 0 0
## C9 2 0 3 0 366
## C3 0 1 1 0 0
preClust <- colnames(cM)[apply(cM, 1 , which.max)]
cbind(preClust, rownames(cM)) #Assignments
## preClust
## [1,] "17_B" "C4"
## [2,] "16_Pre.B" "C2"
## [3,] "08_GMP.Neut" "C10"
## [4,] "25_NK" "C6"
## [5,] "11_CD14.Mono.1" "C8"
## [6,] "02_Early.Eryth" "C12"
## [7,] "21_CD4.N2" "C5"
## [8,] "08_GMP.Neut" "C11"
## [9,] "22_CD4.M" "C7"
## [10,] "01_HSC" "C13"
## [11,] "09_pDC" "C1"
## [12,] "12_CD14.Mono.2" "C9"
## [13,] "15_CLP.2" "C3"
Which cells in the scRNA-seq correspond to Tcells and NK cells?
Use pattern matching strings in combination with grep to extract the scATAC-seq clusters that correspond to these scRNA-seq cell types. | acts as an or statement, so we search for any row in the preClust column of the confusion matix that matches a scRNA-seq cluster number.
unique(proj$predictedGroup_Un)
## [1] "17_B" "16_Pre.B" "08_GMP.Neut" "25_NK"
## [5] "11_CD14.Mono.1" "07_GMP" "02_Early.Eryth" "24_CD8.CM"
## [9] "05_CMP.LMPP" "03_Late.Eryth" "22_CD4.M" "09_pDC"
## [13] "21_CD4.N2" "15_CLP.2" "01_HSC" "19_CD8.N"
## [17] "12_CD14.Mono.2" "20_CD4.N1" "06_CLP.1" "04_Early.Baso"
## [21] "26_Unk" "23_CD8.EM" "13_CD16.Mono"
cTNK <- paste0(paste0(19:25), collapse = "|")
cTNK
## [1] "19|20|21|22|23|24|25"
cNonTNK <- paste0(c(paste0("0", 1:9), 10:13, 15:18), collapse = "|")
cNonTNK
## [1] "01|02|03|04|05|06|07|08|09|10|11|12|13|15|16|17|18"
clustTNK <- rownames(cM)[grep(cTNK, preClust)]
clustTNK
## [1] "C6" "C5" "C7"
clustNonTNK <- rownames(cM)[grep(cNonTNK, preClust)]
clustNonTNK
## [1] "C4" "C2" "C10" "C8" "C12" "C11" "C13" "C1" "C9" "C3"
Identify scRNA-seq cells corresponding to the same cell types.
#RNA get cells in these categories
rnaTNK <- colnames(seRNA)[grep(cTNK, colData(seRNA)$BioClassification)]
head(rnaTNK)
## [1] "PBMC_10x_GREENLEAF_REP1:AAACCCAGTCGTCATA-1"
## [2] "PBMC_10x_GREENLEAF_REP1:AAACCCATCCGATGTA-1"
## [3] "PBMC_10x_GREENLEAF_REP1:AAACCCATCTCAACGA-1"
## [4] "PBMC_10x_GREENLEAF_REP1:AAACCCATCTCTCGAC-1"
## [5] "PBMC_10x_GREENLEAF_REP1:AAACGAACAATCGTCA-1"
## [6] "PBMC_10x_GREENLEAF_REP1:AAACGAACACGATTCA-1"
# get cell not in these categories
rnaNonTNK <- colnames(seRNA)[grep(cNonTNK, colData(seRNA)$BioClassification)]
head(rnaNonTNK)
## [1] "CD34_32_R5:AAACCTGAGTATCGAA-1" "CD34_32_R5:AAACCTGAGTCGTTTG-1"
## [3] "CD34_32_R5:AAACCTGGTTCCACAA-1" "CD34_32_R5:AAACGGGAGCTTCGCG-1"
## [5] "CD34_32_R5:AAACGGGAGGGAGTAA-1" "CD34_32_R5:AAACGGGAGTTACGGG-1"
We create a nested list with two vectors of cell IDs
groupList <- SimpleList(
TNK = SimpleList(
ATAC = proj$cellNames[proj$Clusters %in% clustTNK],
RNA = rnaTNK
),
NonTNK = SimpleList(
ATAC = proj$cellNames[proj$Clusters %in% clustNonTNK],
RNA = rnaNonTNK
)
)
groupList[[1]][[1]] %>% head
## [1] "scATAC_BMMC_R1#GTAGGAGCATTATGGC-1" "scATAC_BMMC_R1#CTAGGATTCTATGAGC-1"
## [3] "scATAC_BMMC_R1#AGTTTGGCACGCGCAT-1" "scATAC_BMMC_R1#GCAGATTGTATGGGTG-1"
## [5] "scATAC_BMMC_R1#ACATGGTCAGTATCTG-1" "scATAC_BMMC_R1#AGATTCGCATAGTCCA-1"
proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupList = groupList,
groupRNA = "BioClassification",
nameCell = "predictedCell_Co",
nameGroup = "predictedGroup_Co",
nameScore = "predictedScore_Co"
)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
pal <- paletteDiscrete(values = colData(seRNA)$BioClassification)
pal
## 01_HSC 02_Early.Eryth 03_Late.Eryth 04_Early.Baso 05_CMP.LMPP
## "#D51F26" "#502A59" "#235D55" "#3D6E57" "#8D2B8B"
## 06_CLP.1 07_GMP 08_GMP.Neut 09_pDC 10_cDC
## "#DE6C3E" "#F9B712" "#D8CE42" "#8E9ACD" "#B774B1"
## 11_CD14.Mono.1 12_CD14.Mono.2 13_CD16.Mono 14_Unk 15_CLP.2
## "#D69FC8" "#C7C8DE" "#8FD3D4" "#89C86E" "#CC9672"
## 16_Pre.B 17_B 18_Plasma 19_CD8.N 20_CD4.N1
## "#CF7E96" "#A27AA4" "#CD4F32" "#6B977E" "#518AA3"
## 21_CD4.N2 22_CD4.M 23_CD8.EM 24_CD8.CM 25_NK
## "#5A5297" "#0F707D" "#5E2E32" "#A95A3C" "#B28D5C"
## 26_Unk
## "#3D3D3D"
Plot the integration for unconstrained (left) and constrained (right) integration. We overlay the scRNA-seq cell types on the ATAC-seq data based on the unconstrained/constrained integration. In the plot below the difference is very subtle, because the cell types of interest are very distinct. There are subtle differences, for example in the T cell clusters. Could we also try this approach with constraining other cell types? Can we combine different constrains/groups?
df <- as_data_frame(cbind(getCellColData(proj), getEmbedding(proj)) ) %>%
rename(c(umap1 = IterativeLSI.UMAP_Dimension_1, umap2 = IterativeLSI.UMAP_Dimension_2))
p1 <- df %>% ggplot() +
geom_point(aes(x = umap1, y = umap2, col = predictedGroup_Un), size = .04) +
guides(col=guide_legend(title="predicted cell type")) +
xlab("umap1") +
ylab("umpa2") +
labs(title = "Predicted cell types after unconstrained integration")
p2 <- df %>% ggplot() +
geom_point(aes(x = umap1, y = umap2, col = predictedScore_Un), size = .04) +
guides(col=guide_legend(title="prediction score")) +
scale_color_viridis_c() +
xlab("umap1") +
ylab("umpa2") +
labs(title = "Predicted cell types after unconstrained integration")
p3 <- df %>% ggplot() +
geom_point(aes(x = umap1, y = umap2, col = predictedGroup_Co), size = .04) +
guides(col=guide_legend(title="predicted cell type")) +
#geom_label(predictedGroup_Co) +
xlab("umap1") +
ylab("umpa2") +
labs(title = "Predicted cell types after unconstrained integration")
p4<- df %>% ggplot() +
geom_point(aes(x = umap1, y = umap2, col = predictedScore_Co), size = .04) +
guides(col=guide_legend(title="prediction score")) +
scale_color_viridis_c() +
xlab("umap1") +
ylab("umpa2") +
labs(title = "Predicted cell types after constrained integration")
gridExtra::grid.arrange(p1, p3, ncol = 1)
p1 <- plotEmbedding(
proj,
colorBy = "cellColData",
name = "predictedGroup_Un",
pal = pal
)
p2 <- plotEmbedding(
proj,
colorBy = "cellColData",
name = "predictedGroup_Co",
pal = pal
)
ggAlignPlots(p1, p2, type = "h")
If we are happy with the integration results we add them to the Arrow Files. Then we can compare the linked gene expression with the inferred gene expression obtained through the gene scores.
#~5 minutes
proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = TRUE,
force= TRUE,
groupList = groupList,
groupRNA = "BioClassification",
nameCell = "predictedCell",
nameGroup = "predictedGroup",
nameScore = "predictedScore",
)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
getAvailableMatrices(proj)
## [1] "GeneIntegrationMatrix" "GeneScoreMatrix" "TileMatrix"
proj <- addImputeWeights(proj)
## Warning in sprintf("Completed Getting Magic Weights!",
## round(object.size(weightList)/10^9, : one argument not used by format 'Completed
## Getting Magic Weights!'
markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
# gene expression values from GeneIntegrationMatrix
p1 <- plotEmbedding(
ArchRProj = proj,
colorBy = "GeneIntegrationMatrix",
name = markerGenes,
continuousSet = "horizonExtra",
embedding = "UMAP",
imputeWeights = getImputeWeights(proj)
)
# Gene score values from GeneScoreMatrix
p2 <- plotEmbedding(
ArchRProj = proj,
colorBy = "GeneScoreMatrix",
continuousSet = "horizonExtra",
name = markerGenes,
embedding = "UMAP",
imputeWeights = getImputeWeights(proj)
)
In the plot below you can see the gene expression values from integration with the scRNA-seq data.
p1c <- lapply(p1, function(x){
x + guides(color = FALSE, fill = FALSE) +
theme_ArchR(baseSize = 6.5) +
theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()
)
})
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
p2c <- lapply(p2, function(x){
x + guides(color = FALSE, fill = FALSE) +
theme_ArchR(baseSize = 6.5) +
theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()
)
})
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
do.call(cowplot::plot_grid, c(list(ncol = 3), p1c))
do.call(cowplot::plot_grid, c(list(ncol = 3), p2c))
In the plot below you can see the gene score values which we calculate above. When comparing these values with the gene expression from above we can see that the gene scores calculated from scATAC-seq are more sensitive in comparison. This means that some regions are open (primed), but not yet transcribed. This is an information that you can only get with multi-omics data.
do.call(cowplot::plot_grid, c(list(ncol = 3), p2c))
The results from the two different methdos for inferring gene expression are similar, but not identical.
cm <- confusionMatrix(proj$Clusters, proj$predictedGroup)
cm[1:10, 1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## 08_GMP.Neut 16_Pre.B 25_NK 07_GMP 04_Early.Baso
## C4 1 1 . 1 .
## C2 . 350 . . .
## C10 201 . . 241 1
## C6 . . 425 . .
## C8 3 . . 145 3
## C12 3 . . . 151
## C5 . . 4 . .
## C11 276 . . 133 51
## C7 . . 2 . .
## C13 12 . . 18 7
dim(cm)
## [1] 13 23
# for each row get the index of the maximum = predicted
# group which best defines a clsuter
apply(cm, 1, which.max)
## C4 C2 C10 C6 C8 C12 C5 C11 C7 C13 C1 C9 C3
## 9 2 4 3 13 8 6 7 11 19 12 10 20
# using these indices extract the corresponding predicted cell type
# we will use these predicted cell types as our new cluster labels
labelNew <- colnames(cm)[apply(cm, 1, which.max)]
labelNew
## [1] "17_B" "16_Pre.B" "07_GMP" "25_NK"
## [5] "11_CD14.Mono.1" "03_Late.Eryth" "20_CD4.N1" "05_CMP.LMPP"
## [9] "22_CD4.M" "01_HSC" "09_pDC" "12_CD14.Mono.2"
## [13] "15_CLP.2"
Rename the scRNA-seq cluster labels to something more interpretable:
remapClust <- c(
"01_HSC" = "Progenitor",
"02_Early.Eryth" = "Erythroid",
"03_Late.Eryth" = "Erythroid",
"04_Early.Baso" = "Basophil",
"05_CMP.LMPP" = "Progenitor",
"06_CLP.1" = "CLP",
"07_GMP" = "GMP",
"08_GMP.Neut" = "GMP",
"09_pDC" = "pDC",
"10_cDC" = "cDC",
"11_CD14.Mono.1" = "Mono",
"12_CD14.Mono.2" = "Mono",
"13_CD16.Mono" = "Mono",
"15_CLP.2" = "CLP",
"16_Pre.B" = "PreB",
"17_B" = "B",
"18_Plasma" = "Plasma",
"19_CD8.N" = "CD8.N",
"20_CD4.N1" = "CD4.N",
"21_CD4.N2" = "CD4.N",
"22_CD4.M" = "CD4.M",
"23_CD8.EM" = "CD8.EM",
"24_CD8.CM" = "CD8.CM",
"25_NK" = "NK"
)
remapClust <- remapClust[names(remapClust) %in% labelNew]
remapClust
## 01_HSC 03_Late.Eryth 05_CMP.LMPP 07_GMP 09_pDC
## "Progenitor" "Erythroid" "Progenitor" "GMP" "pDC"
## 11_CD14.Mono.1 12_CD14.Mono.2 15_CLP.2 16_Pre.B 17_B
## "Mono" "Mono" "CLP" "PreB" "B"
## 20_CD4.N1 22_CD4.M 25_NK
## "CD4.N" "CD4.M" "NK"
Using mapLabels() you can convert the labels to the new system
labelNew2 <- mapLabels(labelNew, oldLabels = names(remapClust), newLabels = remapClust)
labelNew2
## [1] "B" "PreB" "GMP" "NK" "Mono"
## [6] "Erythroid" "CD4.N" "Progenitor" "CD4.M" "Progenitor"
## [11] "pDC" "Mono" "CLP"
proj$Clusters2 <- mapLabels(proj$Clusters, newLabels = labelNew2, oldLabels = unique(proj$Clusters))
p1 <- plotEmbedding(proj, colorBy = "cellColData", name = "Clusters2")
p1
All of the above analysis is also a great starting point for more complex gene regulation analysis.
saveArchRProject(proj, outputDirectory = "Save_Vignette_object_1", load = FALSE)